Grammars & Parsing
Learnability Matters: Active Learning for Video Captioning
This work focuses on the active learning in video captioning. In particular, we propose to address the learnability problem in active learning, which has been brought up by collective outliers in video captioning and neglected in the literature. To start with, we conduct a comprehensive study of collective outliers, exploring their hard-to-learn property and concluding that ground truth inconsistency is one of the main causes. Motivated by this, we design a novel active learning algorithm that takes three complementary aspects, namely learnability, diversity, and uncertainty, into account. Ideally, learnability is reflected by ground truth consistency. Under the active learning scenario where ground truths are not available until human involvement, we measure the consistency on estimated ground truths, where predictions from off-the-shelf models are utilized as approximations to ground truths. These predictions are further used to estimate sample frequency and reliability, evincing the diversity and uncertainty respectively. With the help of our novel caption-wise active learning protocol, our algorithm is capable of leveraging knowledge from humans in a more effective yet intellectual manner. Results on publicly available video captioning datasets with diverse video captioning models demonstrate that our algorithm outperforms SOTA active learning methods by a large margin,e.g.we achieve about 103% of full performance on CIDEr with 25% of human annotations on MSR-VTT.
A Simulating
Below is a FGG for derivations of a PCFG in Chomsky normal form. The largest right-hand side has 3 variables, so k = 2. The variables range over nonterminals, so m = |N| where N is the CFG's nonterminal alphabet. B.1 Plate diagrams Plate diagrams are extensions of graphs that describe repeated structure in Bayesian networks (Buntine, 1994) or factor graphs (Obermeyer et al., 2019). A plate is a subset of variables/factors, together with a count M, indicating that the variables/factors inside the plate are to be replicated M times. But there cannot be edges between different instances of a plate.
Factor Graph Grammars
We propose the use of hyperedge replacement graph grammars for factor graphs, or factor graph grammars (FGGs) for short. FGGs generate sets of factor graphs and can describe a more general class of models than plate notation, dynamic graphical models, case-factor diagrams, and sum-product networks can. Moreover, inference can be done on FGGs without enumerating all the generated factor graphs. For finite variable domains (but possibly infinite sets of graphs), a generalization of variable elimination to FGGs allows exact and tractable inference in many situations. For finite sets of graphs (but possibly infinite variable domains), a FGG can be converted to a single factor graph amenable to standard inference techniques.
Appendix: Structured Reordering for Modeling Latent Alignments in Sequence Transduction ] for each segment, which is the total weight of all derivations with root X
WCFG to PCFG Conversion The algorithm of converting a WCFG to its equivalent PCFG is shown in Algorithm 1. Full proof of this equivalence can be found in Smith and Johnson [1]. Proof of the Dynamic Programming for Marginal Inference We prove the correctness of the dynamic programming algorithm for computing the marginal permutation matrix of separable permutations by induction as follows. As a base case, each word (i.e., segment with length 1) is associated with an identity permutation matrix 1. Architecture and Hyperparameters The detailed architecture of ReMoto is shown in Figure 1. Figure 1: The detailed architecture of our seq2seq model for semantic parsing (view in color). First, the structured reordering module genearates a (relaxed) permutation matrix given the input utterrance. Then, the encoding module generates the representations of the input utterance based on the reordered embeddings, which are computed based on the original embedding and the permutation matrix computed in the first step.
Language Through a Prism: A Spectral Approach for Multiscale Language Representations
Language exhibits structure at different scales, ranging from subwords to words, sentences, paragraphs, and documents. To what extent do deep models capture information at these scales, and can we force them to better capture structure across this hierarchy? We approach this question by focusing on individual neurons, analyzing the behavior of their activations at different timescales. We show that signal processing provides a natural framework for separating structure across scales, enabling us to 1) disentangle scale-specific information in existing embeddings and 2) train models to learn more about particular scales. Concretely, we apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for training models, which uses spectral filters to constrain different neurons to model structure at different scales. Our proposed BERT + Prism model can better predict masked tokens using long-range context and produces multiscale representations that perform better at utterance-and document-level tasks. Our methods are general and readily applicable to other domains besides language, such as images, audio, and video.
Parameterizing Context: Unleashing the Power of Parameter-Efficient Fine-Tuning and In-Context Tuning for Continual Table Semantic Parsing Yongrui Chen 1,2, Guilin Qi
Continual table semantic parsing aims to train a parser on a sequence of tasks, where each task requires the parser to translate natural language into SQL based on taskspecific tables but only offers limited training examples. Conventional methods tend to suffer from overfitting with limited supervision, as well as catastrophic forgetting due to parameter updates. Despite recent advancements that partially alleviate these issues through semi-supervised data augmentation and retention of a few past examples, the performance is still limited by the volume of unsupervised data and stored examples. To overcome these challenges, this paper introduces a novel method integrating parameter-efficient fine-tuning (PEFT) and in-context tuning (ICT) for training a continual table semantic parser. Initially, we present a task-adaptive PEFT framework capable of fully circumventing catastrophic forgetting, which is achieved by freezing the pre-trained model backbone and fine-tuning small-scale prompts. Building on this, we propose a teacher-student framework-based solution. The teacher addresses the few-shot problem using ICT, which procures contextual information by demonstrating a few training examples. In turn, the student leverages the proposed PEFT framework to learn from the teacher's output distribution, then compresses and saves the contextual information to the prompts subsequently, eliminating the need to store any training examples.
MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing Supplementary Material
VLM Evaluation To evaluate two VLMs (Frozen in Time [1] and VideoCLIP [13]), we use a hybrid approach that leverages both prototypical networks [11] and the video-language similarity metrics learned by both models. Below, we show an ablation study where we use only the video prototype networks. We show the performance of using only language similarity in the few-shot case to demonstrate the effects of sample removal, and we also show the effects of our hybrid weighting scheme, where we weight the language embeddings five times more than the video embeddings when constructing the hybrid prototype (as opposed to equal weighting during the regular hybrid approach). Although we perform our ablation study with Frozen-in-Time, and use the same weighting scheme and prototype strategy for VideoCLIP as well. For this study, we show activity and sub-activity classification accuracy in the 5-shot case. We visualize whether a given method uses language, video, or both to create its prototype embeddings.